For this homework, datasets are found from UCI Machine Learning Repository and Kaggle.
All of the datasets will be divided into train and test sets. 2/3 of the original data will be the train set and the remaining 1/3 of the original data will be the test set.
The models will be used are as follows.
- Penalized Regression Approaches (PRA)
- Decision Trees (DT)
- Random Forests (RF)
- Stochastic Gradient Boosting (SGB)
Necessary libraries are as follows.
# libraries
suppressMessages(library(readr))
suppressMessages(library(readxl))
suppressMessages(library(glmnet))
suppressMessages(library(Metrics))
suppressMessages(library(rpart))
suppressMessages(library(rattle))
suppressMessages(library(stats))
suppressMessages(library(e1071))
suppressMessages(library(caret))
suppressMessages(library(randomForest))
suppressMessages(library(gbm))
After reading the dataset, train and test sets are created.
# reading dataset 1
health_data <- suppressMessages(read_csv("/Users/iremarica/Desktop/Homework4/FetalHealth.csv"))
health_data$fetal_health <- as.factor(health_data$fetal_health)
# creating train and test sets for dataset 1
set.seed(1)
health_index <- sample(1:nrow(health_data), (2/3) * nrow(health_data))
health_train <- health_data[health_index, ]
health_test <- health_data[-health_index, ]
paste("Total:", nrow(health_data), " Train:", nrow(health_train),
" Test:", nrow(health_test))
## [1] "Total: 2126 Train: 1417 Test: 709"
For determining the Lasso penalty, lambda, 10-folds Cross Validation is used.
set.seed(2)
health_cv_fit <- cv.glmnet(as.matrix(health_train[, -22]), health_train$fetal_health,
family = "multinomial", nfolds = 10)
health_cv_fit
##
## Call: cv.glmnet(x = as.matrix(health_train[, -22]), y = health_train$fetal_health, nfolds = 10, family = "multinomial")
##
## Measure: Multinomial Deviance
##
## Lambda Measure SE Nonzero
## min 0.001162 0.4645 0.03179 13
## 1se 0.006200 0.4942 0.02579 7
cat(" Lambda min:", health_cv_fit$lambda.min, "\n", "Lambda 1se:",
health_cv_fit$lambda.1se)
## Lambda min: 0.00116184
## Lambda 1se: 0.006200392
plot(health_cv_fit)
After the CV results, minimum lambda value 0.001328 is selected for using in the Penalized Regression Approach.
health_pra <- glmnet(as.matrix(health_train[, -22]), health_train$fetal_health,
family = "multinomial", lambda = health_cv_fit$lambda.min)
health_pra_pred <- data.frame(predict(health_pra, as.matrix(health_test[,
-22]), type = "class"))
health_pra_pred$s0 <- as.factor(health_pra_pred$s0)
confusionMatrix(health_pra_pred$s0, health_test$fetal_health)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 508 29 5
## 2 34 71 10
## 3 5 4 43
##
## Overall Statistics
##
## Accuracy : 0.8773
## 95% CI : (0.8509, 0.9005)
## No Information Rate : 0.7715
## P-Value [Acc > NIR] : 5.152e-13
##
## Kappa : 0.6774
##
## Mcnemar's Test P-Value : 0.3965
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.9287 0.6827 0.74138
## Specificity 0.7901 0.9273 0.98618
## Pos Pred Value 0.9373 0.6174 0.82692
## Neg Pred Value 0.7665 0.9444 0.97717
## Prevalence 0.7715 0.1467 0.08181
## Detection Rate 0.7165 0.1001 0.06065
## Detection Prevalence 0.7645 0.1622 0.07334
## Balanced Accuracy 0.8594 0.8050 0.86378
Fetal health values are predicted with Penalized Regression Model. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.9097 So, 91% of the test set is predicted correctly.
In the Decision Trees, the minimal number of observations per tree leaf and complexity parameter are tuned with cross validation.
For the minimal number of observations per tree leaf, 5, 10, 15, 20, 25 and 30 are used, and for complexity parameter 0.005, 0.01, 0.015, 0.02, 0.025 and 0.03 are used.
set.seed(3)
health_dt_minbucket <- tune.rpart(fetal_health ~ ., data = health_train,
minbucket = seq(10, 35, 5))
plot(health_dt_minbucket, main = "Performance of rpart vs. minbucket")
health_dt_minbucket$best.parameters$minbucket
## [1] 10
health_dt_cp <- tune.rpart(fetal_health ~ ., data = health_train,
cp = seq(0.005, 0.03, 0.005))
plot(health_dt_cp, main = "Performance of rpart vs. cp")
health_dt_cp$best.parameters$cp
## [1] 0.015
As the best parameter value, the minimal number of observations per tree leaf takes the value of 10 and complexity parameter takes the value of 0.015.
health_dt <- rpart(fetal_health ~ ., data = health_train, method = "class",
control = rpart.control(minbucket = health_dt_minbucket$best.parameters$minbucket,
cp = health_dt_cp$best.parameters$cp))
fancyRpartPlot(health_dt)
health_dt_pred <- data.frame(predict(health_dt, health_test[,
-22], type = "class"))
colnames(health_dt_pred) <- "s0"
health_dt_pred$s0 <- as.factor(health_dt_pred$s0)
confusionMatrix(health_dt_pred$s0, health_test$fetal_health)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 531 41 8
## 2 7 62 1
## 3 9 1 49
##
## Overall Statistics
##
## Accuracy : 0.9055
## 95% CI : (0.8815, 0.926)
## No Information Rate : 0.7715
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7281
##
## Mcnemar's Test P-Value : 2.333e-05
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.9707 0.59615 0.84483
## Specificity 0.6975 0.98678 0.98464
## Pos Pred Value 0.9155 0.88571 0.83051
## Neg Pred Value 0.8760 0.93427 0.98615
## Prevalence 0.7715 0.14669 0.08181
## Detection Rate 0.7489 0.08745 0.06911
## Detection Prevalence 0.8181 0.09873 0.08322
## Balanced Accuracy 0.8341 0.79147 0.91473
Fetal health values are predicted with Decision Tree model. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.9055 So, 90% of the test set is predicted correctly.
set.seed(4)
health_rf <- randomForest(data.matrix(health_train[, -22]), health_train$fetal_health,
ntree = 500, nodesize = 5)
health_rf$mtry
## [1] 4
With the default parameters of Random Forest, bootstrap samples uses a random sample of 4 features while splitting each node. For that parameter, 2, 4, 6, 8, 10 and 12 are used while tuning.
fitControl <- trainControl(method = "repeatedcv", number = 3,
repeats = 2, search = "grid")
tunegrid <- expand.grid(.mtry = seq(2, 12, 2))
health_rf <- train(fetal_health ~ ., data = health_train, method = "rf",
metric = "Accuracy", trControl = fitControl, tuneGrid = tunegrid)
print(health_rf)
## Random Forest
##
## 1417 samples
## 21 predictor
## 3 classes: '1', '2', '3'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 2 times)
## Summary of sample sizes: 946, 944, 944, 945, 944, 945, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9315478 0.8016754
## 4 0.9421344 0.8353148
## 6 0.9438954 0.8401843
## 8 0.9438962 0.8411073
## 10 0.9435431 0.8402300
## 12 0.9456647 0.8458604
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 12.
plot(health_rf)
According to the accuracy values, mtry is selected as 6. It has accuracy value of 0.9361331
health_rf <- randomForest(health_train[, -22], health_train$fetal_health,
ntree = 500, nodesize = 5, mtry = 6)
health_rf
##
## Call:
## randomForest(x = health_train[, -22], y = health_train$fetal_health, ntree = 500, mtry = 6, nodesize = 5)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 5.15%
## Confusion matrix:
## 1 2 3 class.error
## 1 1087 18 3 0.01895307
## 2 38 151 2 0.20942408
## 3 5 7 106 0.10169492
Error rate of OOB estimate is 6.49%. Also, the class error of 1(normal) is 2%, the class error of class 2(suspect) is 27% and the class error of 3(pathological) is 12%. This difference can be the result of class imbalance in the dataset. Most of the data contains the class of 1.
varImpPlot(health_rf)
According to the variable importance plot, mean value of short term variability, abnormal short term variability and percentage of time with abnormal long term variability are the most important feature and they have the maximum effect on Gini index.
health_rf_pred <- predict(health_rf, health_test[, -22], type = "class")
confusionMatrix(health_rf_pred, health_test$fetal_health)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 532 31 4
## 2 10 72 2
## 3 5 1 52
##
## Overall Statistics
##
## Accuracy : 0.9252
## 95% CI : (0.9034, 0.9435)
## No Information Rate : 0.7715
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.7917
##
## Mcnemar's Test P-Value : 0.01069
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.9726 0.6923 0.89655
## Specificity 0.7840 0.9802 0.99078
## Pos Pred Value 0.9383 0.8571 0.89655
## Neg Pred Value 0.8944 0.9488 0.99078
## Prevalence 0.7715 0.1467 0.08181
## Detection Rate 0.7504 0.1016 0.07334
## Detection Prevalence 0.7997 0.1185 0.08181
## Balanced Accuracy 0.8783 0.8362 0.94367
Fetal health values are predicted with Random Forest. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.9549 So, 95% of the test set is predicted correctly.
In the Stochastic Gradient Boosting, depth of the tree, learning rate and number of trees are tuned with cross validation.
For depth of the tree, 1, 2 and 3 are used, for learning rate 0.001, 0.005 and 0.01 are used and for number of trees 50, 100 and 150 are used. The minimal number of observations per tree leaf is 10.
set.seed(5)
fitControl <- trainControl(method = "repeatedcv", number = 5,
repeats = 3, verboseIter = FALSE, summaryFunction = multiClassSummary,
allowParallel = FALSE)
tunegrid <- expand.grid(interaction.depth = c(1, 2, 3), shrinkage = c(0.001,
0.005, 0.01), n.trees = c(50, 100, 150), n.minobsinnode = 10)
garbage <- capture.output(health_gbm <- train(fetal_health ~
., data = health_train, method = "gbm", trControl = fitControl,
tuneGrid = tunegrid))
print(health_gbm)
## Stochastic Gradient Boosting
##
## 1417 samples
## 21 predictor
## 3 classes: '1', '2', '3'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 1134, 1133, 1135, 1133, 1133, 1134, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees Accuracy Kappa Mean_F1
## 0.001 1 50 0.8788676 0.6327877 0.7169187
## 0.001 1 100 0.8788701 0.6310139 0.7162940
## 0.001 1 150 0.8791048 0.6315316 0.7172609
## 0.001 2 50 0.9047458 0.7198222 0.8136787
## 0.001 2 100 0.9045152 0.7185403 0.8120870
## 0.001 2 150 0.9054541 0.7205947 0.8153378
## 0.001 3 50 0.9134470 0.7456027 0.8366522
## 0.001 3 100 0.9146282 0.7495203 0.8391715
## 0.001 3 150 0.9141596 0.7471088 0.8366850
## 0.005 1 50 0.8800413 0.6349879 0.7175838
## 0.005 1 100 0.8802836 0.6329188 0.7187789
## 0.005 1 150 0.8833427 0.6404088 0.7256026
## 0.005 2 50 0.9071073 0.7241840 0.8183971
## 0.005 2 100 0.9101598 0.7318914 0.8236594
## 0.005 2 150 0.9132156 0.7409628 0.8300266
## 0.005 3 50 0.9155755 0.7507752 0.8398699
## 0.005 3 100 0.9191016 0.7613765 0.8468611
## 0.005 3 150 0.9247396 0.7787165 0.8570079
## 0.010 1 50 0.8814564 0.6368102 0.7223253
## 0.010 1 100 0.8878127 0.6539154 0.7393570
## 0.010 1 150 0.9035687 0.7037120 0.7979462
## 0.010 2 50 0.9122783 0.7387567 0.8284104
## 0.010 2 100 0.9198008 0.7629265 0.8458993
## 0.010 2 150 0.9256686 0.7824400 0.8605128
## 0.010 3 50 0.9183932 0.7588268 0.8444687
## 0.010 3 100 0.9263819 0.7842595 0.8614184
## 0.010 3 150 0.9320182 0.8029033 0.8738376
## Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value
## 0.6741869 0.8748725 0.8240197 0.9226691
## 0.6729861 0.8735491 0.8231083 0.9239785
## 0.6734671 0.8733659 0.8239362 0.9244624
## 0.7817941 0.8965016 0.8576302 0.9307704
## 0.7794450 0.8961497 0.8570856 0.9316158
## 0.7819187 0.8954298 0.8618817 0.9323942
## 0.8100817 0.9020178 0.8756090 0.9376116
## 0.8121250 0.9035377 0.8772682 0.9383712
## 0.8086914 0.9020330 0.8771252 0.9391352
## 0.6744508 0.8750543 0.8234946 0.9257624
## 0.6726294 0.8716323 0.8283817 0.9269136
## 0.6764100 0.8722870 0.8350154 0.9312607
## 0.7829783 0.8947120 0.8687553 0.9351144
## 0.7856798 0.8950572 0.8776225 0.9393335
## 0.7913687 0.8978328 0.8846436 0.9424546
## 0.8112061 0.9020474 0.8824858 0.9409975
## 0.8183328 0.9055392 0.8894916 0.9442760
## 0.8283022 0.9122522 0.8977370 0.9494469
## 0.6758310 0.8731517 0.8317869 0.9280922
## 0.6881381 0.8748241 0.8434804 0.9353116
## 0.7456249 0.8835748 0.8798007 0.9433789
## 0.7916976 0.8980346 0.8807255 0.9413711
## 0.8116838 0.9067995 0.8927703 0.9465946
## 0.8306779 0.9151898 0.9000709 0.9487598
## 0.8148972 0.9044639 0.8877499 0.9443435
## 0.8343105 0.9147517 0.8994646 0.9500495
## 0.8511530 0.9230663 0.9056160 0.9523427
## Mean_Precision Mean_Recall Mean_Detection_Rate Mean_Balanced_Accuracy
## 0.8240197 0.6741869 0.2929559 0.7745297
## 0.8231083 0.6729861 0.2929567 0.7732676
## 0.8239362 0.6734671 0.2930349 0.7734165
## 0.8576302 0.7817941 0.3015819 0.8391479
## 0.8570856 0.7794450 0.3015051 0.8377974
## 0.8618817 0.7819187 0.3018180 0.8386743
## 0.8756090 0.8100817 0.3044823 0.8560498
## 0.8772682 0.8121250 0.3048761 0.8578314
## 0.8771252 0.8086914 0.3047199 0.8553622
## 0.8234946 0.6744508 0.2933471 0.7747526
## 0.8283817 0.6726294 0.2934279 0.7721308
## 0.8350154 0.6764100 0.2944476 0.7743485
## 0.8687553 0.7829783 0.3023691 0.8388451
## 0.8776225 0.7856798 0.3033866 0.8403685
## 0.8846436 0.7913687 0.3044052 0.8446007
## 0.8824858 0.8112061 0.3051918 0.8566267
## 0.8894916 0.8183328 0.3063672 0.8619360
## 0.8977370 0.8283022 0.3082465 0.8702772
## 0.8317869 0.6758310 0.2938188 0.7744914
## 0.8434804 0.6881381 0.2959376 0.7814811
## 0.8798007 0.7456249 0.3011896 0.8145998
## 0.8807255 0.7916976 0.3040928 0.8448661
## 0.8927703 0.8116838 0.3066003 0.8592416
## 0.9000709 0.8306779 0.3085562 0.8729338
## 0.8877499 0.8148972 0.3061311 0.8596805
## 0.8994646 0.8343105 0.3087940 0.8745311
## 0.9056160 0.8511530 0.3106727 0.8871097
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 3, shrinkage = 0.01 and n.minobsinnode = 10.
plot(health_gbm)
According to the results, final values used for the model is n.trees=150, interaction.depth=3, shrinkage=0.01 and n.minobsinnode = 10.
health_gbm_final = suppressWarnings(gbm(fetal_health ~ ., data = health_train,
distribution = "multinomial", n.trees = 150, interaction.depth = 3,
n.minobsinnode = 10, shrinkage = 0.01))
summary(health_gbm_final)
## var
## abnormal_short_term_variability abnormal_short_term_variability
## percentage_of_time_with_abnormal_long_term_variability percentage_of_time_with_abnormal_long_term_variability
## histogram_mean histogram_mean
## mean_value_of_short_term_variability mean_value_of_short_term_variability
## prolongued_decelerations prolongued_decelerations
## accelerations accelerations
## histogram_mode histogram_mode
## `baseline value` `baseline value`
## histogram_min histogram_min
## uterine_contractions uterine_contractions
## histogram_median histogram_median
## histogram_max histogram_max
## histogram_number_of_peaks histogram_number_of_peaks
## mean_value_of_long_term_variability mean_value_of_long_term_variability
## histogram_width histogram_width
## histogram_variance histogram_variance
## fetal_movement fetal_movement
## light_decelerations light_decelerations
## severe_decelerations severe_decelerations
## histogram_number_of_zeroes histogram_number_of_zeroes
## histogram_tendency histogram_tendency
## rel.inf
## abnormal_short_term_variability 21.950351068
## percentage_of_time_with_abnormal_long_term_variability 20.419051435
## histogram_mean 16.219250601
## mean_value_of_short_term_variability 15.975734410
## prolongued_decelerations 6.252649044
## accelerations 5.143927031
## histogram_mode 2.929799121
## `baseline value` 2.637718803
## histogram_min 1.979950300
## uterine_contractions 1.516021868
## histogram_median 1.472416703
## histogram_max 1.044207516
## histogram_number_of_peaks 0.684141734
## mean_value_of_long_term_variability 0.668165018
## histogram_width 0.621640302
## histogram_variance 0.418581419
## fetal_movement 0.061195631
## light_decelerations 0.005197997
## severe_decelerations 0.000000000
## histogram_number_of_zeroes 0.000000000
## histogram_tendency 0.000000000
Histogram mean is the most important feature with the value of 22.01981133.
health_gbm_pred <- predict(health_gbm, health_test[, -22], type = "raw")
confusionMatrix(health_gbm_pred, health_test$fetal_health)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 533 34 8
## 2 9 69 1
## 3 5 1 49
##
## Overall Statistics
##
## Accuracy : 0.9182
## 95% CI : (0.8955, 0.9373)
## No Information Rate : 0.7715
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7673
##
## Mcnemar's Test P-Value : 0.001632
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.9744 0.66346 0.84483
## Specificity 0.7407 0.98347 0.99078
## Pos Pred Value 0.9270 0.87342 0.89091
## Neg Pred Value 0.8955 0.94444 0.98624
## Prevalence 0.7715 0.14669 0.08181
## Detection Rate 0.7518 0.09732 0.06911
## Detection Prevalence 0.8110 0.11142 0.07757
## Balanced Accuracy 0.8576 0.82347 0.91781
Fetal health values of the articles are predicted with Stochastic Gradient Boosting. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.9309. So, 93% of the test set is predicted correctly.
To be conclude, accuracy values of the prediction of test set are:
- Penalized Regression Approaches (PRA): 0.9097
- Decision Trees (DT): 0.9055
- Random Forests (RF): 0.9549
- Stochastic Gradient Boosting (SGB): 0.9309.
So, the random forest can be selected for this dataset as the best predictive model compared to others.
Description: Leaving credit card services is a crucial problem for banks.The managers in banks want to predict who is going to churn and after the prediction, they can get precautions to prevent this. Target variable is the attrition of customer (attrited or existing). Also, this dataset contains features such as customer’s age, salary, marital status, credit card limit, credit card category.
Tasks: Classification (This dataset has class imbalance with a ratio of 5:1)
Number of observations: 10127
Number of features: 23
Feature characteristics: Integer, real, categorical, ordinal
After reading the dataset, non-predictive features are excluded and data types are updated. Finally, train and test sets are created.
# reading dataset 2
churn_data <- suppressMessages(read_csv("/Users/iremarica/Desktop/Homework4/BankChurners.csv"))
## Warning: Missing column names filled in: 'X1' [1]
# substracting non-predictive features
churn_data <- churn_data[, -c(1, 2, 23, 24)]
# defining data types
for (i in 1:nrow(churn_data)) {
if (churn_data[i, 1] == "Attrited Customer") {
churn_data[i, 1] <- 1
}
if (churn_data[i, 1] == "Existing Customer") {
churn_data[i, 1] <- 0
}
}
churn_data$Attrition_Flag <- as.factor(churn_data$Attrition_Flag)
churn_data$Gender <- as.factor(churn_data$Gender)
churn_data$Marital_Status <- as.factor(churn_data$Marital_Status)
churn_data$Education_Level <- as.factor(churn_data$Education_Level)
churn_data$Income_Category <- as.factor(churn_data$Income_Category)
churn_data$Card_Category <- as.factor(churn_data$Card_Category)
# creating train and test sets for dataset 2
set.seed(6)
churn_index <- sample(1:nrow(churn_data), (2/3) * nrow(churn_data))
churn_train <- churn_data[churn_index, ]
churn_test <- churn_data[-churn_index, ]
paste("Total:", nrow(churn_data), " Train:", nrow(churn_train),
" Test:", nrow(churn_test))
## [1] "Total: 10127 Train: 6751 Test: 3376"
For determining the Lasso penalty, lambda, 10-folds Cross Validation is used.
set.seed(7)
churn_cv_fit <- cv.glmnet(data.matrix(churn_train[, -1]), as.matrix(churn_train[,
1]), family = "binomial", nfolds = 10)
churn_cv_fit
##
## Call: cv.glmnet(x = data.matrix(churn_train[, -1]), y = as.matrix(churn_train[, 1]), nfolds = 10, family = "binomial")
##
## Measure: Binomial Deviance
##
## Lambda Measure SE Nonzero
## min 0.000566 0.4745 0.008114 17
## 1se 0.004385 0.4815 0.007770 11
cat(" Lambda min:", churn_cv_fit$lambda.min, "\n", "Lambda 1se:",
churn_cv_fit$lambda.1se)
## Lambda min: 0.0005662878
## Lambda 1se: 0.004384561
plot(churn_cv_fit)
After the CV results, minimum lambda value 0.000477 is selected for using in the Penalized Regression Approach.
churn_pra <- glmnet(data.matrix(churn_train[, -1]), as.matrix(churn_train[,
1]), family = "binomial", lambda = churn_cv_fit$lambda.min)
churn_pra_pred <- data.frame(predict(churn_pra, data.matrix(churn_test[,
-1]), type = "class"))
confusionMatrix(churn_pra_pred[, 1], churn_test$Attrition_Flag)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2735 240
## 1 93 308
##
## Accuracy : 0.9014
## 95% CI : (0.8908, 0.9112)
## No Information Rate : 0.8377
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5933
##
## Mcnemar's Test P-Value : 1.237e-15
##
## Sensitivity : 0.9671
## Specificity : 0.5620
## Pos Pred Value : 0.9193
## Neg Pred Value : 0.7681
## Prevalence : 0.8377
## Detection Rate : 0.8101
## Detection Prevalence : 0.8812
## Balanced Accuracy : 0.7646
##
## 'Positive' Class : 0
##
Attrition values are predicted with Penalized Regression Model. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.9043 So, 90% of the test set is predicted correctly.
In the Decision Trees, the minimal number of observations per tree leaf and complexity parameter are tuned with cross validation.
For the minimal number of observations per tree leaf, 5, 10, 15, 20, 25 and 30 are used, and for complexity parameter 0.005, 0.01, 0.015, 0.02, 0.025 and 0.03 are used.
set.seed(8)
churn_dt_minbucket <- tune.rpart(Attrition_Flag ~ ., data = churn_train,
minbucket = seq(5, 30, 5))
plot(churn_dt_minbucket, main = "Performance of rpart vs. minbucket")
churn_dt_minbucket$best.parameters$minbucket
## [1] 5
churn_dt_cp <- tune.rpart(Attrition_Flag ~ ., data = churn_train,
cp = seq(0.005, 0.03, 0.005))
plot(churn_dt_cp, main = "Performance of rpart vs. cp")
churn_dt_cp$best.parameters$cp
## [1] 0.005
As the best parameter value, the minimal number of observations per tree leaf takes the value of 5 and complexity parameter takes the value of 0.005.
churn_dt <- rpart(Attrition_Flag ~ ., data = churn_train, method = "class",
control = rpart.control(minbucket = churn_dt_minbucket$best.parameters$minbucket,
cp = churn_dt_cp$best.parameters$cp))
fancyRpartPlot(churn_dt)
churn_dt_testpred <- predict(churn_dt, churn_test[, -1], type = "class")
confusionMatrix(churn_dt_testpred, churn_test$Attrition_Flag)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2715 96
## 1 113 452
##
## Accuracy : 0.9381
## 95% CI : (0.9294, 0.946)
## No Information Rate : 0.8377
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7752
##
## Mcnemar's Test P-Value : 0.2684
##
## Sensitivity : 0.9600
## Specificity : 0.8248
## Pos Pred Value : 0.9658
## Neg Pred Value : 0.8000
## Prevalence : 0.8377
## Detection Rate : 0.8042
## Detection Prevalence : 0.8326
## Balanced Accuracy : 0.8924
##
## 'Positive' Class : 0
##
Attrition values are predicted with Decision Tree model. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.9396 So, 94% of the test set is predicted correctly.
set.seed(9)
churn_rf <- randomForest(churn_train[, -1], churn_train$Attrition_Flag,
ntree = 500, nodesize = 5)
churn_rf$mtry
## [1] 4
With the default parameters of Random Forest, bootstrap samples uses a random sample of 4 features while splitting each node. For that parameter, 2, 4, 6, 8, 10 and 12 are used while tuning.
fitControl <- trainControl(method = "repeatedcv", number = 3,
repeats = 2, search = "grid")
tunegrid <- expand.grid(.mtry = seq(2, 12, 2))
churn_rf <- train(Attrition_Flag ~ ., data = churn_train, method = "rf",
metric = "Accuracy", trControl = fitControl, tuneGrid = tunegrid)
print(churn_rf)
## Random Forest
##
## 6751 samples
## 19 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 2 times)
## Summary of sample sizes: 4500, 4501, 4501, 4500, 4501, 4501, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9155686 0.6128091
## 4 0.9448227 0.7738760
## 6 0.9549695 0.8203862
## 8 0.9591173 0.8389089
## 10 0.9600802 0.8441147
## 12 0.9592655 0.8415980
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 10.
plot(churn_rf)
According to the accuracy values, mtry is selected as 12. It has accuracy value of 0.9609692 It can be seen in the plot, accuracy is increases with the increase in mtry.
churn_rf <- randomForest(churn_train[, -1], churn_train$Attrition_Flag,
ntree = 500, nodesize = 5, mtry = 12)
churn_rf
##
## Call:
## randomForest(x = churn_train[, -1], y = churn_train$Attrition_Flag, ntree = 500, mtry = 12, nodesize = 5)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 12
##
## OOB estimate of error rate: 3.75%
## Confusion matrix:
## 0 1 class.error
## 0 5586 86 0.0151622
## 1 167 912 0.1547729
Error rate of OOB estimate is 3.73%. Also, the class error of 0 is 0.16% and the class error of 1 is 15%. This difference can be the result of class imbalance in the dataset.
varImpPlot(churn_rf)
According to the variable importance plot, Total Transaction value is the most important feature and it has the maximum effect on Gini index.
churn_rf_pred <- predict(churn_rf, churn_test[, -1], type = "class")
confusionMatrix(churn_rf_pred, churn_test$Attrition_Flag)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2779 66
## 1 49 482
##
## Accuracy : 0.9659
## 95% CI : (0.9593, 0.9718)
## No Information Rate : 0.8377
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8732
##
## Mcnemar's Test P-Value : 0.1357
##
## Sensitivity : 0.9827
## Specificity : 0.8796
## Pos Pred Value : 0.9768
## Neg Pred Value : 0.9077
## Prevalence : 0.8377
## Detection Rate : 0.8232
## Detection Prevalence : 0.8427
## Balanced Accuracy : 0.9311
##
## 'Positive' Class : 0
##
Attrition values are predicted with Random Forest. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.9648 So, 96% of the test set is predicted correctly. The main error is results from predicting the existing customers as attrited customer due to class imbalance.
In the Stochastic Gradient Boosting, depth of the tree, learning rate and number of trees are tuned with cross validation.
For depth of the tree, 1, 2 and 3 are used, for learning rate 0.001, 0.005 and 0.01 are used and for number of trees 50, 100 and 150 are used. The minimal number of observations per tree leaf is 10.
set.seed(10)
fitControl <- trainControl(method = "repeatedcv", number = 5,
repeats = 3, verboseIter = FALSE, allowParallel = FALSE)
tunegrid <- expand.grid(interaction.depth = c(1, 2, 3), shrinkage = c(0.001,
0.005, 0.01), n.trees = c(50, 100, 150), n.minobsinnode = 10)
garbage <- capture.output(churn_gbm <- train(Attrition_Flag ~
., data = churn_train, method = "gbm", trControl = fitControl,
tuneGrid = tunegrid))
print(churn_gbm)
## Stochastic Gradient Boosting
##
## 6751 samples
## 19 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 5402, 5401, 5400, 5400, 5401, 5401, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees Accuracy Kappa
## 0.001 1 50 0.8401719 0.00000000
## 0.001 1 100 0.8401719 0.00000000
## 0.001 1 150 0.8401719 0.00000000
## 0.001 2 50 0.8401719 0.00000000
## 0.001 2 100 0.8401719 0.00000000
## 0.001 2 150 0.8401719 0.00000000
## 0.001 3 50 0.8401719 0.00000000
## 0.001 3 100 0.8401719 0.00000000
## 0.001 3 150 0.8401719 0.00000000
## 0.005 1 50 0.8401719 0.00000000
## 0.005 1 100 0.8401719 0.00000000
## 0.005 1 150 0.8401719 0.00000000
## 0.005 2 50 0.8401719 0.00000000
## 0.005 2 100 0.8401719 0.00000000
## 0.005 2 150 0.8482690 0.08218471
## 0.005 3 50 0.8401719 0.00000000
## 0.005 3 100 0.8401719 0.00000000
## 0.005 3 150 0.8590333 0.18983478
## 0.010 1 50 0.8401719 0.00000000
## 0.010 1 100 0.8401719 0.00000000
## 0.010 1 150 0.8477768 0.08012661
## 0.010 2 50 0.8401719 0.00000000
## 0.010 2 100 0.8754253 0.33338847
## 0.010 2 150 0.8959658 0.50276443
## 0.010 3 50 0.8401719 0.00000000
## 0.010 3 100 0.8856949 0.41244052
## 0.010 3 150 0.9104330 0.59221673
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 3, shrinkage = 0.01 and n.minobsinnode = 10.
plot(churn_gbm)
According to the results, final values used for the model is n.trees=150, interaction.depth=3, shrinkage=0.01 and n.minobsinnode = 10. It can be seen in the plot, selecting shrinkage as 0.01 makes a huge difference in accuracy.
churn_gbm_pred <- predict(churn_gbm, churn_test[, -1], type = "raw")
confusionMatrix(churn_gbm_pred, churn_test$Attrition_Flag)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2796 283
## 1 32 265
##
## Accuracy : 0.9067
## 95% CI : (0.8964, 0.9163)
## No Information Rate : 0.8377
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5792
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9887
## Specificity : 0.4836
## Pos Pred Value : 0.9081
## Neg Pred Value : 0.8923
## Prevalence : 0.8377
## Detection Rate : 0.8282
## Detection Prevalence : 0.9120
## Balanced Accuracy : 0.7361
##
## 'Positive' Class : 0
##
Attrition values are predicted with Stochastic Gradient Boosting. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.9188 So, 92% of the test set is predicted correctly. Again, the main error is results from predicting the existing customers as attrited customer due to class imbalance.
To be conclude, accuracy values of the prediction of test set are:
- Penalized Regression Approaches (PRA): 0.9043
- Decision Trees (DT): 0.9396
- Random Forests (RF): 0.9648
- Stochastic Gradient Boosting (SGB): 0.9188
So, the random forest can be selected for this dataset as the best predictive model compared to others.
Description: 1000 sports articles were labeled using Amazon Mechanical Turk as objective or subjective. The main aim is the predicting the article’s objectivity with investigating the usage of nouns, adjectives, adverbs, symbols, etc. Target variable is the label of article (objective/subjective).
Tasks: Classification
Number of observations: 1000
Number of features: 62
Feature characteristics: Integer
After reading the dataset, non-predictive features are excluded and data types are updated. Finally, train and test sets are created.
# reading dataset 3
article_data <- suppressMessages(read_csv("/Users/iremarica/Desktop/Homework4/SportsArticles.csv"))
## Warning: Missing column names filled in: 'X1' [1]
# substracting non-predictive features (X1, TextID and URL)
article_data <- article_data[, -c(1, 2, 3)]
# defining data types
article_data$Label <- as.factor(article_data$Label)
# creating train and test sets for dataset 3
set.seed(11)
article_index <- sample(1:nrow(article_data), (2/3) * nrow(article_data))
article_train <- article_data[article_index, ]
article_test <- article_data[-article_index, ]
paste("Total:", nrow(article_data), " Train:", nrow(article_train),
" Test:", nrow(article_test))
## [1] "Total: 1000 Train: 666 Test: 334"
For determining the Lasso penalty, lambda, 10-folds Cross Validation is used.
set.seed(12)
article_cv_fit <- cv.glmnet(data.matrix(article_train[, -1]),
as.matrix(article_train[, 1]), family = "binomial", nfolds = 10)
article_cv_fit
##
## Call: cv.glmnet(x = data.matrix(article_train[, -1]), y = as.matrix(article_train[, 1]), nfolds = 10, family = "binomial")
##
## Measure: Binomial Deviance
##
## Lambda Measure SE Nonzero
## min 0.00388 0.9166 0.06775 41
## 1se 0.04355 0.9812 0.03767 13
cat(" Lambda min:", article_cv_fit$lambda.min, "\n", "Lambda 1se:",
article_cv_fit$lambda.1se)
## Lambda min: 0.00387694
## Lambda 1se: 0.0435506
plot(article_cv_fit)
After the CV results, minimum lambda value 0.00981 is selected for using in the Penalized Regression Approach.
article_pra <- glmnet(data.matrix(article_train[, -1]), as.matrix(article_train[,
1]), family = "binomial", lambda = article_cv_fit$lambda.min)
article_pra_pred <- data.frame(predict(article_pra, data.matrix(article_test[,
-1]), type = "class"))
article_pra_pred$s0 <- as.factor(article_pra_pred$s0)
confusionMatrix(article_pra_pred$s0, article_test$Label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction objective subjective
## objective 192 39
## subjective 20 83
##
## Accuracy : 0.8234
## 95% CI : (0.7781, 0.8627)
## No Information Rate : 0.6347
## P-Value [Acc > NIR] : 3.137e-14
##
## Kappa : 0.606
##
## Mcnemar's Test P-Value : 0.01911
##
## Sensitivity : 0.9057
## Specificity : 0.6803
## Pos Pred Value : 0.8312
## Neg Pred Value : 0.8058
## Prevalence : 0.6347
## Detection Rate : 0.5749
## Detection Prevalence : 0.6916
## Balanced Accuracy : 0.7930
##
## 'Positive' Class : objective
##
Labels of the articles are predicted with Penalized Regression Model. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.8443 So, 84% of the test set is predicted correctly.
In the Decision Trees, the minimal number of observations per tree leaf and complexity parameter are tuned with cross validation.
For the minimal number of observations per tree leaf, 5, 10, 15, 20, 25 and 30 are used, and for complexity parameter 0.005, 0.01, 0.015, 0.02, 0.025 and 0.03 are used.
set.seed(13)
article_dt_minbucket <- tune.rpart(Label ~ ., data = article_train,
minbucket = seq(5, 30, 5))
plot(article_dt_minbucket, main = "Performance of rpart vs. minbucket")
article_dt_minbucket$best.parameters$minbucket
## [1] 5
article_dt_cp <- tune.rpart(Label ~ ., data = article_train,
cp = seq(0.005, 0.03, 0.005))
plot(article_dt_cp, main = "Performance of rpart vs. cp")
article_dt_cp$best.parameters$cp
## [1] 0.03
As the best parameter value, the minimal number of observations per tree leaf takes the value of 5 and complexity parameter takes the value of 0.01.
article_dt <- rpart(Label ~ ., data = article_train, method = "class",
control = rpart.control(minbucket = article_dt_minbucket$best.parameters$minbucket,
cp = article_dt_cp$best.parameters$cp))
fancyRpartPlot(article_dt)
article_dt_pred <- predict(article_dt, article_test[, -1], type = "class")
confusionMatrix(article_dt_pred, article_test$Label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction objective subjective
## objective 165 28
## subjective 47 94
##
## Accuracy : 0.7754
## 95% CI : (0.7269, 0.8191)
## No Information Rate : 0.6347
## P-Value [Acc > NIR] : 2.173e-08
##
## Kappa : 0.5312
##
## Mcnemar's Test P-Value : 0.03767
##
## Sensitivity : 0.7783
## Specificity : 0.7705
## Pos Pred Value : 0.8549
## Neg Pred Value : 0.6667
## Prevalence : 0.6347
## Detection Rate : 0.4940
## Detection Prevalence : 0.5778
## Balanced Accuracy : 0.7744
##
## 'Positive' Class : objective
##
Attrition values are predicted with Decision Tree model. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.8353 So, 83% of the test set is predicted correctly.
set.seed(14)
article_rf <- randomForest(article_train[, -1], article_train$Label,
ntree = 500, nodesize = 5)
article_rf$mtry
## [1] 7
With the default parameters of Random Forest, bootstrap samples uses a random sample of 7 features while splitting each node. For that parameter, 5, 7, 9, 11, 13 and 15 are used while tuning.
fitControl <- trainControl(method = "repeatedcv", number = 5,
repeats = 3, search = "grid")
tunegrid <- expand.grid(.mtry = seq(5, 15, 2))
article_rf <- train(Label ~ ., data = article_train, method = "rf",
metric = "Accuracy", trControl = fitControl, tuneGrid = tunegrid)
print(article_rf)
## Random Forest
##
## 666 samples
## 59 predictor
## 2 classes: 'objective', 'subjective'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 532, 533, 533, 533, 533, 533, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 5 0.8182595 0.6030295
## 7 0.8187343 0.6054435
## 9 0.8167254 0.6006689
## 11 0.8187456 0.6054645
## 13 0.8202531 0.6090822
## 15 0.8157455 0.5991784
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 13.
plot(article_rf)
According to the accuracy values, mtry is selected as 9. It has accuracy value of 0.8133281
article_rf <- randomForest(article_train[, -1], article_train$Label,
ntree = 500, nodesize = 5, mtry = 9)
article_rf
##
## Call:
## randomForest(x = article_train[, -1], y = article_train$Label, ntree = 500, mtry = 9, nodesize = 5)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 9
##
## OOB estimate of error rate: 18.92%
## Confusion matrix:
## objective subjective class.error
## objective 359 64 0.1513002
## subjective 62 181 0.2551440
Error rate of OOB estimate is 18.62%. Also, the class error of objective is 0.11% and the class error of 1 is 30%.
varImpPlot(article_rf)
According to the variable importance plot, LS:Frequency of list item markers and PRP$:Frequency of possessive pronouns are the most important features and they have the maximum effect on Gini index.
article_rf_pred <- predict(article_rf, article_test[, -1], type = "class")
confusionMatrix(article_rf_pred, article_test$Label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction objective subjective
## objective 176 26
## subjective 36 96
##
## Accuracy : 0.8144
## 95% CI : (0.7684, 0.8546)
## No Information Rate : 0.6347
## P-Value [Acc > NIR] : 5.622e-13
##
## Kappa : 0.6065
##
## Mcnemar's Test P-Value : 0.253
##
## Sensitivity : 0.8302
## Specificity : 0.7869
## Pos Pred Value : 0.8713
## Neg Pred Value : 0.7273
## Prevalence : 0.6347
## Detection Rate : 0.5269
## Detection Prevalence : 0.6048
## Balanced Accuracy : 0.8085
##
## 'Positive' Class : objective
##
Labels of the articles are predicted with Random Forest. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.8473 So, 85% of the test set is predicted correctly.
In the Stochastic Gradient Boosting, depth of the tree, learning rate and number of trees are tuned with cross validation.
For depth of the tree, 1, 2 and 3 are used, for learning rate 0.001, 0.005 and 0.01 are used and for number of trees 50, 100 and 150 are used. The minimal number of observations per tree leaf is 10.
set.seed(15)
fitControl <- trainControl(method = "repeatedcv", number = 5,
repeats = 3, verboseIter = FALSE, summaryFunction = twoClassSummary,
classProbs = TRUE, allowParallel = FALSE)
tunegrid <- expand.grid(interaction.depth = c(1, 2, 3), shrinkage = c(0.001,
0.005, 0.01), n.trees = c(50, 100, 150), n.minobsinnode = 10)
garbage <- suppressWarnings(capture.output(article_gbm <- train(Label ~
., data = article_train, method = "gbm", trControl = fitControl,
tuneGrid = tunegrid)))
print(article_gbm)
## Stochastic Gradient Boosting
##
## 666 samples
## 59 predictor
## 2 classes: 'objective', 'subjective'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 532, 533, 533, 533, 533, 532, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees ROC Sens Spec
## 0.001 1 50 0.8472028 1.0000000 0.0000000
## 0.001 1 100 0.8487066 1.0000000 0.0000000
## 0.001 1 150 0.8493557 1.0000000 0.0000000
## 0.001 2 50 0.8525256 1.0000000 0.0000000
## 0.001 2 100 0.8537802 1.0000000 0.0000000
## 0.001 2 150 0.8543623 1.0000000 0.0000000
## 0.001 3 50 0.8563527 1.0000000 0.0000000
## 0.001 3 100 0.8573803 1.0000000 0.0000000
## 0.001 3 150 0.8573920 1.0000000 0.0000000
## 0.005 1 50 0.8469414 1.0000000 0.0000000
## 0.005 1 100 0.8523878 0.9338936 0.4827381
## 0.005 1 150 0.8509636 0.8960691 0.5924887
## 0.005 2 50 0.8546853 1.0000000 0.0000000
## 0.005 2 100 0.8567238 0.9354715 0.4896825
## 0.005 2 150 0.8584464 0.8960504 0.6146259
## 0.005 3 50 0.8586486 1.0000000 0.0000000
## 0.005 3 100 0.8596539 0.9448833 0.4870465
## 0.005 3 150 0.8610910 0.9070588 0.6062642
## 0.010 1 50 0.8478035 0.9354342 0.4744331
## 0.010 1 100 0.8513792 0.8802988 0.6405896
## 0.010 1 150 0.8521180 0.8676751 0.6640306
## 0.010 2 50 0.8580020 0.9417460 0.4925454
## 0.010 2 100 0.8601241 0.8881793 0.6599206
## 0.010 2 150 0.8629500 0.8787208 0.6996315
## 0.010 3 50 0.8597358 0.9472456 0.4938776
## 0.010 3 100 0.8627375 0.8921289 0.6612812
## 0.010 3 150 0.8653259 0.8795145 0.7010488
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 3, shrinkage = 0.01 and n.minobsinnode = 10.
plot(article_gbm)
According to the results, final values used for the model is n.trees=150, interaction.depth=3, shrinkage=0.01 and n.minobsinnode = 10.
article_gbm_pred <- predict(article_gbm, article_test[, -1],
type = "raw")
confusionMatrix(article_gbm_pred, article_test$Label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction objective subjective
## objective 183 29
## subjective 29 93
##
## Accuracy : 0.8263
## 95% CI : (0.7814, 0.8654)
## No Information Rate : 0.6347
## P-Value [Acc > NIR] : 1.152e-14
##
## Kappa : 0.6255
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8632
## Specificity : 0.7623
## Pos Pred Value : 0.8632
## Neg Pred Value : 0.7623
## Prevalence : 0.6347
## Detection Rate : 0.5479
## Detection Prevalence : 0.6347
## Balanced Accuracy : 0.8128
##
## 'Positive' Class : objective
##
Labels of the articles are predicted with Stochastic Gradient Boosting. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.8413 So, 84% of the test set is predicted correctly. Again, the main error is results from predicting the existing customers as attrited customer due to class imbalance.
To be conclude, accuracy values of the prediction of test set are:
- Penalized Regression Approaches (PRA): 0.8443
- Decision Trees (DT): 0.8353
- Random Forests (RF): 0.8473
- Stochastic Gradient Boosting (SGB): 0.8413
So, the random forest can be selected for this dataset as the best predictive model compared to others. However, their all has nearly the same accuracy value. So, additional improvements that can increase accuracy may be beneficial.
Description: This dataset contains data about superconductors and their relevant features. The aim is predicting the critical temperature for superconductor. Tasks: Regression Number of observations: 21263
Number of features: 81 Feature characteristics: Real
After reading the dataset, train and test sets are created. Originally, this dataset was having 21263 onservations. However, this analysis is made with random sample of 5000 due to the problems in running in R.
# reading dataset 4
cond_data <- suppressMessages(read_csv("/Users/iremarica/Desktop/Homework4/superconductor.csv"))
set.seed(16)
cond_index <- sample(1:nrow(cond_data), 5000)
cond_data <- cond_data[cond_index, ]
# creating train and test sets for dataset 4
set.seed(17)
cond_index <- sample(1:nrow(cond_data), (2/3) * nrow(cond_data))
cond_train <- cond_data[cond_index, ]
cond_test <- cond_data[-cond_index, ]
paste("Total:", nrow(cond_data), " Train:", nrow(cond_train),
" Test:", nrow(cond_test))
## [1] "Total: 5000 Train: 3333 Test: 1667"
set.seed(18)
cond_cv_fit <- cv.glmnet(data.matrix(cond_train[, -82]), data.matrix(cond_train[,
82]), family = "gaussian", nfolds = 10)
cond_cv_fit
##
## Call: cv.glmnet(x = data.matrix(cond_train[, -82]), y = data.matrix(cond_train[, 82]), nfolds = 10, family = "gaussian")
##
## Measure: Mean-Squared Error
##
## Lambda Measure SE Nonzero
## min 0.00240 316.2 11.12 78
## 1se 0.03567 326.5 10.88 62
cat(" Lambda min:", cond_cv_fit$lambda.min, "\n", "Lambda 1se:",
cond_cv_fit$lambda.1se)
## Lambda min: 0.002402019
## Lambda 1se: 0.03566921
plot(cond_cv_fit)
After the CV results, minimum lambda value 0.002459 is selected for using in the Penalized Regression Approach.
cond_pra <- glmnet(data.matrix(cond_train[, -82]), cond_train$critical_temp,
family = "gaussian", lambda = cond_cv_fit$lambda.min)
cond_pra_pred <- data.frame(predict(cond_pra, data.matrix(cond_test[,
-82])))
colnames(cond_pra_pred) <- "s0"
rmse(cond_test$critical_temp, cond_pra_pred$s0)
## [1] 17.68855
Critical temperature values are predicted with Penalized Ragression Approach. Then, RMSE (Root Mean Squared Error) values are used for comparing the models. RMSE value of Penalized Ragression Approach is 17.47174.
In the Decision Trees, the minimal number of observations per tree leaf and complexity parameter are tuned with cross validation.
For the minimal number of observations per tree leaf, 5, 6 and 7 are used, and for complexity parameter 0.005, 0.01 and 0.015 are used.
set.seed(19)
cond_dt_minbucket <- tune.rpart(critical_temp ~ ., data = cond_train,
minbucket = c(5, 6, 7))
plot(cond_dt_minbucket, main = "Performance of rpart vs. minbucket")
cond_dt_minbucket$best.parameters$minbucket
## [1] 5
cond_dt_cp <- tune.rpart(critical_temp ~ ., data = cond_train,
cp = c(0.005, 0.01, 0.015))
plot(cond_dt_cp, main = "Performance of rpart vs. cp")
cond_dt_cp$best.parameters$cp
## [1] 0.005
As the best parameter value, the minimal number of observations per tree leaf takes the value of 5 and complexity parameter takes the value of 0.005.
cond_dt <- rpart(critical_temp ~ ., data = cond_train, method = "anova",
control = rpart.control(minbucket = cond_dt_minbucket$best.parameters$minbucket,
cp = cond_dt_cp$best.parameters$cp))
fancyRpartPlot(cond_dt)
cond_dt_pred <- data.frame(predict(cond_dt, cond_test[, -82]))
colnames(cond_dt_pred) <- "s0"
rmse(cond_test$critical_temp, cond_dt_pred$s0)
## [1] 17.41911
Critical temperature values are predicted with Decision Tree model. Then, RMSE values are used for comparing the models. RMSE value of Decision Tree is 17.28844.
set.seed(20)
cond_rf <- randomForest(data.matrix(cond_train[, -82]), cond_train$critical_temp,
ntree = 500, nodesize = 5)
cond_rf$mtry
## [1] 27
With the default parameters of Random Forest, bootstrap samples uses a random sample of 27 features while splitting each node. For that parameter, 25, 27 and 29 are used while tuning.
fitControl <- trainControl(method = "repeatedcv", number = 3,
repeats = 2, search = "grid")
tunegrid <- expand.grid(.mtry = seq(25, 27, 29))
cond_rf <- train(critical_temp ~ ., data = cond_train, method = "rf",
trControl = fitControl, tuneGrid = tunegrid)
print(cond_rf)
## Random Forest
##
## 3333 samples
## 81 predictor
##
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 2 times)
## Summary of sample sizes: 2223, 2221, 2222, 2223, 2222, 2221, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 12.07697 0.8711659 7.61113
##
## Tuning parameter 'mtry' was held constant at a value of 25
According to the RMSE values, mtry is selected as 25.
cond_rf <- randomForest(cond_train[, -82], cond_train$critical_temp,
ntree = 500, nodesize = 5, mtry = 25)
varImpPlot(cond_rf)
According to the variable importance plot, range of thermal conductivity is the most important feature and it has the maximum effect on node purity.
cond_rf_pred <- data.frame(predict(cond_rf, cond_test[, -82]))
colnames(cond_rf_pred) <- "s0"
rmse(cond_test$critical_temp, cond_rf_pred$s0)
## [1] 11.7012
Critical temperature values are predicted with Random Forest. Then, RMSE values are used for comparing the models. RMSE value of Random Forest is 11.70342
In the Stochastic Gradient Boosting, depth of the tree, learning rate and number of trees are tuned with cross validation.
For depth of the tree, 1, 2 and 3 are used, for learning rate 0.001, 0.005 and 0.01 are used and for number of trees 50, 100 and 150 are used. The minimal number of observations per tree leaf is 10.
set.seed(15)
fitControl <- trainControl(method = "repeatedcv", number = 5,
repeats = 3, verboseIter = FALSE, allowParallel = FALSE)
tunegrid <- expand.grid(interaction.depth = c(1, 2, 3), shrinkage = c(0.001,
0.005, 0.01), n.trees = c(50, 100, 150), n.minobsinnode = 10)
garbage <- suppressWarnings(capture.output(cond_gbm <- train(critical_temp ~
., data = cond_train, method = "gbm", trControl = fitControl,
tuneGrid = tunegrid)))
print(cond_gbm)
## Stochastic Gradient Boosting
##
## 3333 samples
## 81 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 2666, 2667, 2666, 2665, 2668, 2665, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE Rsquared MAE
## 0.001 1 50 32.62523 0.5533129 27.67605
## 0.001 1 100 31.84922 0.5748501 26.97639
## 0.001 1 150 31.11919 0.5890751 26.31968
## 0.001 2 50 32.42004 0.6559535 27.47111
## 0.001 2 100 31.44503 0.6576384 26.58487
## 0.001 2 150 30.53580 0.6599337 25.75881
## 0.001 3 50 32.36867 0.6954543 27.43889
## 0.001 3 100 31.34418 0.6971619 26.52150
## 0.001 3 150 30.38621 0.6991506 25.66557
## 0.005 1 50 29.77755 0.6085155 25.10161
## 0.005 1 100 27.04066 0.6333489 22.56833
## 0.005 1 150 24.98056 0.6438804 20.59398
## 0.005 2 50 28.88745 0.6642530 24.25788
## 0.005 2 100 25.64711 0.6778480 21.29254
## 0.005 2 150 23.31911 0.6920473 19.11261
## 0.005 3 50 28.64215 0.7012131 24.10565
## 0.005 3 100 25.17344 0.7146197 20.99371
## 0.005 3 150 22.64269 0.7271795 18.69022
## 0.010 1 50 27.02805 0.6333693 22.55584
## 0.010 1 100 23.45461 0.6496371 19.05223
## 0.010 1 150 21.47307 0.6621218 16.95981
## 0.010 2 50 25.63618 0.6782711 21.28608
## 0.010 2 100 21.62097 0.7035182 17.48365
## 0.010 2 150 19.45361 0.7228087 15.37634
## 0.010 3 50 25.15205 0.7128403 20.97529
## 0.010 3 100 20.75616 0.7376875 16.91859
## 0.010 3 150 18.43074 0.7532463 14.64805
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 3, shrinkage = 0.01 and n.minobsinnode = 10.
plot(cond_gbm)
According to the results, final values used for the model is n.trees=150, interaction.depth=3, shrinkage=0.01 and n.minobsinnode = 10.
cond_gbm_pred <- data.frame(predict(cond_gbm, cond_test[, -82]))
colnames(cond_rf_pred) <- "s0"
rmse(cond_test$critical_temp, cond_rf_pred$s0)
## [1] 11.7012
Critical temperatures are predicted with Stochastic Gradient Boosting. Then, RMSE values are used for comparing the models. RMSE value of Stochastic Gradient Boosting is 11.70342.
To be conclude, RMSE values of the prediction of test set are:
- Penalized Regression Approaches (PRA): 17.47174
- Decision Trees (DT): 17.28844
- Random Forests (RF): 11.70342
- Stochastic Gradient Boosting (SGB): 11.70342
So, Random Forest or Stochastic Gradient Boosting can be selected for this dataset as the best predictive model compared to others, since they have the same and minimum RMSE value.
In the classification problems, random forest models are selected as the best model compared to other. Because it has the highest accuracy value in all cases.
In the regression problem, random forest and stochastic gradient boosting model have given the same RMSE values.
It can be said that random forest and stochastic gradient boosting models gives better prediction results than penalized regression approach and decision trees.